Goto

Collaborating Authors

 Port of Spain


TRAJECT-Bench:A Trajectory-Aware Benchmark for Evaluating Agentic Tool Use

He, Pengfei, Dai, Zhenwei, He, Bing, Liu, Hui, Tang, Xianfeng, Lu, Hanqing, Li, Juanhui, Ding, Jiayuan, Mukherjee, Subhabrata, Wang, Suhang, Xing, Yue, Tang, Jiliang, Dumoulin, Benoit

arXiv.org Artificial Intelligence

Large language model (LLM)-based agents increasingly rely on tool use to complete real-world tasks. While existing works evaluate the LLMs' tool use capability, they largely focus on the final answers yet overlook the detailed tool usage trajectory, i.e., whether tools are selected, parameterized, and ordered correctly. We introduce TRAJECT-Bench, a trajectory-aware benchmark to comprehensively evaluate LLMs' tool use capability through diverse tasks with fine-grained evaluation metrics. TRAJECT-Bench pairs high-fidelity, executable tools across practical domains with tasks grounded in production-style APIs, and synthesizes trajectories that vary in breadth (parallel calls) and depth (interdependent chains). Besides final accuracy, TRAJECT-Bench also reports trajectory-level diagnostics, including tool selection and argument correctness, and dependency/order satisfaction. Analyses reveal failure modes such as similar tool confusion and parameter-blind selection, and scaling behavior with tool diversity and trajectory length where the bottleneck of transiting from short to mid-length trajectories is revealed, offering actionable guidance for LLMs' tool use.


LLM-Symbolic Integration for Robust Temporal Tabular Reasoning

Kulkarni, Atharv, Dixit, Kushagra, Srikumar, Vivek, Roth, Dan, Gupta, Vivek

arXiv.org Artificial Intelligence

Temporal tabular question answering presents a significant challenge for Large Language Models (LLMs), requiring robust reasoning over structured data, which is a task where traditional prompting methods often fall short. These methods face challenges such as memorization, sensitivity to table size, and reduced performance on complex queries. To overcome these limitations, we introduce TempTabQA-C, a synthetic dataset designed for systematic and controlled evaluations, alongside a symbolic intermediate representation that transforms tables into database schemas. This structured approach allows LLMs to generate and execute SQL queries, enhancing generalization and mitigating biases. By incorporating adaptive few-shot prompting with contextually tailored examples, our method achieves superior robustness, scalability, and performance. Experimental results consistently highlight improvements across key challenges, setting a new benchmark for robust temporal reasoning with LLMs.


Can Knowledge Editing Really Correct Hallucinations?

Huang, Baixiang, Chen, Canyu, Xu, Xiongxiao, Payani, Ali, Shu, Kai

arXiv.org Artificial Intelligence

Large Language Models (LLMs) suffer from hallucinations, referring to the nonfactual information in generated content, despite their superior capacities across tasks. Meanwhile, knowledge editing has been developed as a new popular paradigm to correct the erroneous factual knowledge encoded in LLMs with the advantage of avoiding retraining from scratch. However, one common issue of existing evaluation datasets for knowledge editing is that they do not ensure LLMs actually generate hallucinated answers to the evaluation questions before editing. When LLMs are evaluated on such datasets after being edited by different techniques, it is hard to directly adopt the performance to assess the effectiveness of different knowledge editing methods in correcting hallucinations. Thus, the fundamental question remains insufficiently validated: Can knowledge editing really correct hallucinations in LLMs? We proposed HalluEditBench to holistically benchmark knowledge editing methods in correcting real-world hallucinations. First, we rigorously construct a massive hallucination dataset with 9 domains, 26 topics and more than 6, 000 hallucinations. Then, we assess the performance of knowledge editing methods in a holistic way on five dimensions including Efficacy, Generalization, Portability, Locality, and Robustness. Through HalluEditBench, we have provided new insights into the potentials and limitations of different knowledge editing methods in correcting hallucinations, which could inspire future improvements and facilitate the progress in the field of knowledge editing. Considering Table 1: Performance measured by Accuracy (%) the high cost of retraining LLMs from scratch, of Llama2-7B before editing ("Pre-edit") and after knowledge editing has been designed as a new applying typical knowledge editing methods ("Postedit") paradigm to correct erroneous or outdated factual on common existing evaluation datasets. When such datasets are adopted to evaluate the performance of LLMs after being edited, it is hard to directly use the scores to judge the effectiveness of different knowledge editing techniques in correcting hallucinations, which is the motivation of applying knowledge editing to LLMs. To better illustrate this point, following the evaluation setting in (Zhang et al., 2024e), we conducted a preliminary study to examine the pre-edit and post-edit performances of Llama2-7B on the aforementioned Who is the Chief Scientist of OpenAI? Who is the Chief Scientist of OpenAI? Who is the Chief Scientist of OpenAI?


Adoption of Bots Across the Insurance Value Chain

#artificialintelligence

Today, customers expect their queries to be answered on their terms and as quickly as possible. What are the significant factors to create a superior customer experience? A PwC study states that nearly 80% of US consumers prefer the above-stated factors as the most important elements for a positive customer experience. Customers outside the US value time taken to offer more. Adoption of AI bot technology in insurance brings a "human touch" and helps insurers to "build real connections" with their customers without frustrating them.


ICATT hosts business forum on artificial intelligence

#artificialintelligence

The Institute of Chartered Accountants of Trinidad and Tobago (ICATT), earlier this month, hosted a business forum comprising an audience of financial executives from various sectors including energy, banking and finance at the KPMG Headquarters in Port of Spain. The event themed "Artificial Intelligence (AI) – the Future of Accounting" exposed professional accountants to global developments, good practice guidance and knowledge-sharing that will enhance their roles and domain across the economy. In delivering the opening remarks, ICATT's president, Stacy-Ann Golding, praised the ICATT Professional Accountants in Business (PAIB) Committee for organising the forum, the topic of which, she noted, was critical to improving the readiness of today's accounting professionals to deal with AI and its implications. Bring a depth of insight and experience were featured speakers Nigel Romano, managing director and chief executive officer, JMMB Bank and Leslie Lee Fook, director of Artificial Intelligence, Automation and Analytics at Incus Services Ltd. Romano spoke on the use of AI, "I can recall the now obsolete, clunky computerised systems used in accounting during the 1970s and how they helped speed up work processes at that time. Today a similar shift is happening as current systems will soon be overshadowed by those powered by self-learning / machine learning capabilities."


SilverHook gains edge with high-tech AI in race to the podium

#artificialintelligence

Last year, after breaking the Guinness World Record for the Key West to Cuba run, we wondered what was next for the #77 Lucas Oil SilverHook ocean racing powerboat? We found the answer in the 50th anniversary of the Trinidad & Tobago Great Race, one of the most grueling races in the world. The 115-mile endurance course starts in Trinidad's Port of Spain, where you head north and then east near the island before popping into the Atlantic Ocean for a 50-mile sprint to the finish in Store Bay, Tobago. Because of the logistical difficulties of racing on foreign shores, we were the first American entry in 29 years. We knew we would face stiff competition from Jumbie, Cat Killer, Mr. Solo and other local rivals that know the course well.